{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 5.2 语音识别\n",
    "\n",
    "语音识别，又称自动语音识别（ASR），是一种将口语转换为书面格式或根据口头命令执行特定操作的技术。它涉及分析语音模式、语音学和语言结构的机器学习模型，以准确地转录和理解人类语音。\n",
    "\n",
    "[Whisper](https://openai.com/research/whisper) 是一个由 OpenAI 发布的流行的开源模型，用于 ASR 和语音翻译。Whisper 能够转录多种语言的语音，并将这些语言翻译成英语。\n",
    "\n",
    "由于 Whisper 的底层是基于 Transformer 的编码器-解码器架构，因此可以使用 BigDL-LLM INT4 优化功能对其进行有效优化。在本教程中，我们将指导您在 BigDL-LLM 优化的 Whisper 模型上构建一个语音识别应用程序，该应用程序可以将音频文件转录/翻译为文本。\n",
    "\n",
    "## 5.2.1 安装程序包\n",
    "\n",
    "如果您还没有设置环境，请先按照 [第二章](../ch_2_Environment_Setup/README.md) 中的说明进行设置。然后安装 bigdl-llm："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install bigdl-llm[all]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于需要处理音频文件，您还需要安装用于音频分析的 `librosa` 软件包。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U librosa"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.2 下载音频文件\n",
    "\n",
    "首先，让我们准备一些音频文件。例如，您可以从支持多语言的 [common_voice](https://huggingface.co/datasets/common_voice/viewer/en/train) 数据集中下载音频文件。这里我们随机选择了一个英文音频文件和一个中文音频文件。您可以根据自己的喜好选择不同的音频文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "!wget -O audio_en.mp3 https://datasets-server.huggingface.co/assets/common_voice/--/en/train/5/audio/audio.mp3\n",
    "!wget -O audio_zh.mp3 https://datasets-server.huggingface.co/assets/common_voice/--/zh-CN/train/2/audio/audio.mp3"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "您可以播放下载完成的音频："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "\n",
    "IPython.display.display(IPython.display.Audio(\"audio_en.mp3\"))\n",
    "IPython.display.display(IPython.display.Audio(\"audio_zh.mp3\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.3 加载预训练好的 Whisper 模型\n",
    "\n",
    "现在，让我们加载一个经过预训练的 Whisper 模型，例如 [whisper-medium](https://huggingface.co/openai/whisper-medium) 。OpenAI 发布了各种尺寸的预训练 Whisper 模型（包括 [whisper-small](https://huggingface.co/openai/whisper-small)、[whisper-tiny](https://huggingface.co/openai/whisper-tiny) 等），您可以选择最符合您要求的模型。\n",
    "\n",
    "您只需在 `bigdl-llm` 中使用单行 `transformers`-style API，即可加载具有 INT4 优化功能的 `whisper-medium`（通过指定 `load_in_4bit=True`），如下所示。请注意，对于 Whisper，我们使用了 `AutoModelForSpeechSeq2Seq` 类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModelForSpeechSeq2Seq\n",
    "\n",
    "model = AutoModelForSpeechSeq2Seq.from_pretrained(pretrained_model_name_or_path=\"openai/whisper-medium\",\n",
    "                                                  load_in_4bit=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.4 加载 Whisper Processor\n",
    "\n",
    "无论是音频预处理还是将模型输出从标记转换为文本的后处理，我们都需要 Whisper Processor。您只需使用官方的 `transformers` API 加载 `WhisperProcessor` 即可："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import WhisperProcessor\n",
    "\n",
    "processor = WhisperProcessor.from_pretrained(pretrained_model_name_or_path=\"openai/whisper-medium\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.5 转录英文音频\n",
    "\n",
    "使用带有 INT4 优化功能的 BigDL-LLM 优化 Whisper 模型并加载 Whisper Processor 后，就可以开始通过模型推理转录音频了。\n",
    "\n",
    "让我们从英语音频文件 `audio_en.mp3` 开始。在将其输入 Whisper Processor 之前，我们需要从原始语音波形中提取序列数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import librosa\n",
    "\n",
    "data_en, sample_rate_en = librosa.load(\"audio_en.mp3\", sr=16000)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    ">\n",
    "> 对于 `whisper-medium`，其 `WhisperFeatureExtractor`（`WhisperProcessor` 的一部分）默认使用 16,000Hz 采样率从音频中提取特征。关键的是要用模型的 `WhisperFeatureExtractor` 以采样率加载音频文件，以便精确识别。\n",
    "\n",
    "然后，我们就可以根据序列数据转录音频文件，使用的方法与使用官方的 `transformers` API 完全相同："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inference time: xxxx s\n",
      "-------------------- English Transcription --------------------\n",
      "[' Book me a reservation for mid-day at French Camp Academy.']\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import time\n",
    "\n",
    "# 定义任务类型\n",
    "forced_decoder_ids = processor.get_decoder_prompt_ids(language=\"english\", task=\"transcribe\")\n",
    "\n",
    "with torch.inference_mode():\n",
    "    # 为 Whisper 模型提取输入特征\n",
    "    input_features = processor(data_en, sampling_rate=sample_rate_en, return_tensors=\"pt\").input_features\n",
    "\n",
    "    # 为转录预测 token id\n",
    "    st = time.time()\n",
    "    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)\n",
    "    end = time.time()\n",
    "\n",
    "    # 将 token id 解码为文本\n",
    "    transcribe_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n",
    "\n",
    "    print(f'Inference time: {end-st} s')\n",
    "    print('-'*20, 'English Transcription', '-'*20)\n",
    "    print(transcribe_str)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    ">\n",
    "> `forced_decoder_ids` 为不同语言和任务（转录或翻译）定义上下文 token 。如果设置为 `None`，Whisper 将自动预测它们。\n",
    "\n",
    "\n",
    "## 5.2.6 转录中文音频并翻译成英文\n",
    "\n",
    "现在把目光转向中文音频 `audio_zh.mp3`。Whisper 可以转录多语言音频，并将其翻译成英文。这里唯一的区别是通过 `forced_decoder_ids` 来定义特定的上下文 token："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inference time: xxxx s\n",
      "-------------------- Chinese Transcription --------------------\n",
      "['制作时将各原料研磨']\n",
      "Inference time: xxxx s\n",
      "-------------------- Chinese to English Translation --------------------\n",
      "[' When making the dough, grind the ingredients.']\n"
     ]
    }
   ],
   "source": [
    "# 提取序列数据\n",
    "data_zh, sample_rate_zh = librosa.load(\"audio_zh.mp3\", sr=16000)\n",
    "\n",
    "# 定义中文转录任务\n",
    "forced_decoder_ids = processor.get_decoder_prompt_ids(language=\"chinese\", task=\"transcribe\")\n",
    "\n",
    "with torch.inference_mode():\n",
    "    input_features = processor(data_zh, sampling_rate=sample_rate_zh, return_tensors=\"pt\").input_features\n",
    "    st = time.time()\n",
    "    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)\n",
    "    end = time.time()\n",
    "    transcribe_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n",
    "\n",
    "    print(f'Inference time: {end-st} s')\n",
    "    print('-'*20, 'Chinese Transcription', '-'*20)\n",
    "    print(transcribe_str)\n",
    "\n",
    "# 定义中文转录以及翻译任务\n",
    "forced_decoder_ids = processor.get_decoder_prompt_ids(language=\"chinese\", task=\"translate\")\n",
    "\n",
    "with torch.inference_mode():\n",
    "    input_features = processor(data_zh, sampling_rate=sample_rate_zh, return_tensors=\"pt\").input_features\n",
    "    st = time.time()\n",
    "    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)\n",
    "    end = time.time()\n",
    "    translate_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n",
    "\n",
    "    print(f'Inference time: {end-st} s')\n",
    "    print('-'*20, 'Chinese to English Translation', '-'*20)\n",
    "    print(translate_str)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.7 后续学习\n",
    "\n",
    "在接下来的章节中，我们将探讨如何将 BigDL-LLM 与 langchain 结合使用，langchain 是一个专为使用语言模型开发应用程序而设计的框架。有了 langchain 集成，应用程序开发过程就可以变得简单。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
